Skip to content

Add Mistral Small 4 (119B MoE) support via mistral4.py#1037

Open
ProducerGuy wants to merge 1 commit intoml-explore:mainfrom
ProducerGuy:mistral-small-4-moe-support
Open

Add Mistral Small 4 (119B MoE) support via mistral4.py#1037
ProducerGuy wants to merge 1 commit intoml-explore:mainfrom
ProducerGuy:mistral-small-4-moe-support

Conversation

@ProducerGuy
Copy link
Copy Markdown

Summary

Adds support for Mistral Small 4 (mistralai/Mistral-Small-4-119B-2603), a 119B-parameter Mixture-of-Experts model with 128 experts and 4 active per token (6B active parameters).

Enables mlx-community/Mistral-Small-4-119B-2603-4bit to load and run.

Before

ValueError: Received 1260 parameters not in model:
language_model.model.layers.0.mlp.gate.weight,
language_model.model.layers.0.mlp.shared_experts.down_proj.weight,
language_model.model.layers.0.mlp.switch_mlp.down_proj.weight,
language_model.model.layers.0.self_attn.kv_a_proj_with_mqa.weight,
...

After

Prompt: 7 tokens, 2.597 tokens-per-sec
Generation: 20 tokens, 105.054 tokens-per-sec
Peak memory: 67.088 GB

Changes

New file: mlx_lm/models/mistral4.py

  • MoE feedforward with SwitchGLU routing (128 experts, top-4 selection)
  • Shared expert support via standard MLP
  • MLA (Multi-head Latent Attention) with explicit kv_b_proj linear layer for KV decompression — this is architecturally distinct from DeepSeek V3's MultiLinear approach; Mistral Small 4 uses a single linear projection rather than per-head Kronecker-style decomposition
  • Standard attention fallback for any dense layers
  • All dimensions (kv_lora_rank, q_lora_rank, qk_rope_head_dim, v_head_dim, etc.) read from config.json, nothing hardcoded

Modified: mlx_lm/models/mistral3.py (9 lines added, 2 removed)

  • Routes to mistral4.Model when n_routed_experts is present in text_config
  • Structural detection (not model_type string matching) — forward-compatible with future MoE Mistral variants
  • Existing dense Ministral 3B/8B/14B models completely unaffected

Notes

  • apply_chat_template works out of the box once the model loads — the raw prompt test output (JSON-formatted) is expected without the chat template and is not a bug
  • reasoning_effort parameter support is intentionally left for a follow-up — this PR focuses on correct inference only
  • Tested on MacBook Pro M5 Max, 128GB unified memory, macOS 26.3

Test plan

  • Mistral Small 4 (119B MoE) loads without weight key errors
  • Generates correct factual output ("What is the capital of Japan?" → Tokyo)
  • 105 tok/s generation, 67GB peak memory
  • Dense Ministral3 routing still works (class instantiation verified)
  • No changes to existing model files other than routing in mistral3.py

Adds MoE + MLA model support for Mistral Small 4
(mistralai/Mistral-Small-4-119B-2603), enabling
mlx-community/Mistral-Small-4-119B-2603-4bit to load and run.

New file: mlx_lm/models/mistral4.py
- MoE feedforward with SwitchGLU routing (128 experts, top-4)
- Shared expert support
- MLA attention with compressed KV via explicit kv_b_proj
  (distinct from DeepSeek V3's MultiLinear approach)
- Standard attention fallback for dense layers
- All dimensions read from config, nothing hardcoded

Modified: mlx_lm/models/mistral3.py
- Structural routing: n_routed_experts presence routes to mistral4
- Forward-compatible with future MoE Mistral variants
- Dense Ministral 3B/8B/14B models unaffected

Tested on MacBook Pro M5 Max (128GB):
- 104 tok/s generation
- 67 GB peak memory
- Correct factual output confirmed

Before: ValueError: Received 1260 parameters not in model
After: Model loads and generates correctly

Chat template (apply_chat_template) works out of the box
once the model loads — no additional changes needed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant